21 research outputs found
Recommended from our members
Perceptual quality assessment of real-world images and videos
The development of online social-media venues and rapid advances in technology by camera and mobile device manufacturers have led to the creation and consumption of a seemingly limitless supply of visual content. However, a vast majority of these digital images and videos are often afflicted with annoying artifacts during acquisition, subsequent storage, and transmission over the network. All these factors impact the quality of the visual media as perceived by a human observer, thereby compromising their quality of experience (QoE).
This dissertation focuses on constructing datasets that are representative of real-world image and video distortions as well as on designing algorithms that accurately predict the perceptual quality of images and videos. The primary goal of this research is to design and demonstrate automatic image and continuous-time video quality predictors that can effectively tackle the widely diverse authentic spatial, temporal, and network-induced distortions -- contrary to all present-day algorithms that operate on single, synthetic visual distortions and predict a single overall quality score for a given video.
I introduce an image quality database which contains a large number of images captured using a representative variety of modern mobile devices and afflicted with a widely diverse authentic image distortions. I will also describe the design of an online crowdsourcing system which aided a very large-scale image quality assessment subjective study. This data collection facilitated the design of a new image quality predictor that is founded on the principles of natural scene statistics of images in different color spaces and transform domains. This new quality method is capable of assessing the quality of images with complex mixtures of distortions and yields high correlation with human perception.
Pertaining to videos, this dissertation describes a video quality database created to understand the impact of network-induced distortions on an end user's quality of experience. I present the details of a large-scale subjective study that I conducted to gather continuous-time ground truth QoE scores on a collection of 180 videos afflicted with diverse stalling events. I also present my analysis of the temporal variations in the perceived QoE due to the time-varying video quality and present insights on the impact of relevant human cognitive aspects such as long-term and short-term memory and recency on quality perception. Next, I present a continuous-time objective QoE predicting model that effectively captures the complex interactions between the aforementioned human cognitive elements, spatial and temporal distortions, properties of stalling events, and models the state of any given client-side network buffer. I also show how the proposed framework can be extended by further supplementing with any number of additional inputs (or by eliminating any ineffective ones), based on the information available at the content providers during the design of adaptive stream-switching algorithms. This QoE predictor supports future research in the design of quality-aware stream-switching algorithms which could control the position, location, and length of stalls, given a network bandwidth budget and the end user's device information, such that the end user's QoE is maximized.Computer Science
Activity Driven Weakly Supervised Object Detection
Weakly supervised object detection aims at reducing the amount of supervision
required to train detection models. Such models are traditionally learned from
images/videos labelled only with the object class and not the object bounding
box. In our work, we try to leverage not only the object class labels but also
the action labels associated with the data. We show that the action depicted in
the image/video can provide strong cues about the location of the associated
object. We learn a spatial prior for the object dependent on the action (e.g.
"ball" is closer to "leg of the person" in "kicking ball"), and incorporate
this prior to simultaneously train a joint object detection and action
classification model. We conducted experiments on both video datasets and image
datasets to evaluate the performance of our weakly supervised object detection
model. Our approach outperformed the current state-of-the-art (SOTA) method by
more than 6% in mAP on the Charades video dataset.Comment: CVPR'19 camera read
How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language
One of the factors that have hindered progress in the areas of sign language
recognition, translation, and production is the absence of large annotated
datasets. Towards this end, we introduce How2Sign, a multimodal and multiview
continuous American Sign Language (ASL) dataset, consisting of a parallel
corpus of more than 80 hours of sign language videos and a set of corresponding
modalities including speech, English transcripts, and depth. A three-hour
subset was further recorded in the Panoptic studio enabling detailed 3D pose
estimation. To evaluate the potential of How2Sign for real-world impact, we
conduct a study with ASL signers and show that synthesized videos using our
dataset can indeed be understood. The study further gives insights on
challenges that computer vision should address in order to make progress in
this field.
Dataset website: http://how2sign.github.io/Comment: Accepted at CVPR 2021. Dataset website: http://how2sign.github.io
Beyond web-scraping: Crowd-sourcing a geographically diverse image dataset
Current dataset collection methods typically scrape large amounts of data
from the web. While this technique is extremely scalable, data collected in
this way tends to reinforce stereotypical biases, can contain personally
identifiable information, and typically originates from Europe and North
America. In this work, we rethink the dataset collection paradigm and introduce
GeoDE, a geographically diverse dataset with 61,940 images from 40 classes and
6 world regions, and no personally identifiable information, collected through
crowd-sourcing. We analyse GeoDE to understand differences in images collected
in this manner compared to web-scraping. Despite the smaller size of this
dataset, we demonstrate its use as both an evaluation and training dataset,
highlight shortcomings in current models, as well as show improved performances
when even small amounts of GeoDE (1000 - 2000 images per region) are added to a
training dataset. We release the full dataset and code at
https://geodiverse-data-collection.cs.princeton.edu